Linear regression model selection using p-values 
when the model dimension grows 

By Piotr Pokarowski Q Jan Mielniczuk ^\ and PawelTeisseyre 

Abstract. We consider a new criterion-based approach to model selection in linear regression. Prop- 
erties of selection criteria based on p-values of a likelihood ratio statistic are studied for families of linear 
regression models. We prove that such procedures are consistent i.e. the minimal true model is chosen 
with probability tending to 1 even when the number of models under consideration slowly increases with a 
sample size. The simulation study indicates that introduced methods perform promisingly when compared 
with Akaike and Bayesian Information Criteria. 

Keywords: model selection criterion; random or deterministic design linear model; p- value based methods; 
Akaike Information Criterion; Bayesian Information Criterion. 



1 Introduction 

We reconsider a problem of model choice for a linear regression 

Y = X/3 + e, (1) 

where Y is an n x 1 vector of observations which variability we would like to explain, X is a n x M n design 
matrix consisting of vectors of M n potential regressors collected from n objects and e = (ex, . . . ,e n )' is 
an unknown vector of errors, assumed to have iV(0, c 2 I) distribution. Vector (3 = (fa, . . . , fau n )' is an 
unknown vector of parameters. In the paper we will consider the cases corresponding to experimental and 
observational data when rows of X are either deterministic or random. Suppose that some covariates are 
unrelated to the prediction of Y, so that the corresponding coefficients fa are zero. It is assumed that 
the true model is a submodel of As it is not a priori known which variables are significant in order 
to make the last assumption realistic it is natural to let the horizon M n to grow with n and allow in this 
way potentially large models. 

Model selection is a core issue of statistical modeling. In a framework of linear regression the problem has 
been intensively studied under various conditions imposed on design matrix X and growth of M n . The aim 
of such procedures is to choose the most parsimonious model describing adequately a given data set. For 



the review of these advances we refer to Potscher and Leeb (2008). The main problem here is a modeler's 



dillcma that underfitting leads to omission of important variables in the model whereas overfitting involves 
unnecessary parameter estimation for redundant coefficients which lessens the precision of the model fit. 
In the article we contribute to a line of research in which the chosen model is the maximiser of a chosen 
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criterion function. In a seminal paper which is typical for this approach Akaike (1970), starting with 
the idea of maximising the expectation of predictive likelihood, has shown that the usual likelihood has 
to be modified to obtain an unbiased estimator of the expectation. The likelihood modified in such a 
way is known as Akaike Information Criterion (AIC). Variety of other modifications of the likelihood 
followed, with Bayes Information Criterion (BIC) being the most frequently used competitor. Recently, 



Pokarowski and Mielniczuk (2010) introduced model selection criteria mPVC and MPVC based on p- 



values of a likelihood ratio statistic for families of linear models with deterministic covariates and constant 
dimension. The idea in the case of minimal p-value criterion mPVC is to consider the model selection 
problem from a point of view of testing a certain null hypothesis Ho against several hypotheses Hi and to 
choose the hypothesis (the model) for which the null hypothesis is most strongly rejected in its favour. The 
decision in the case of mPVC is based on a new criterion which is the minimal p- value of the underlying 
test statistics. We stress that the discussed selection method is based on a completely different paradigm 
than the existing approaches: instead of penalizing the likelihood ratio statistic directly by subtracting a 
complexity penalty its appropriate function is chosen as a selection criterion. 

We study conditions under which such a rule is consistent i.e. it choses the minimal true model with 
probability tending to 1 when the sample size increases. Our main theoretical result stated in Theorem 1 
asserts that this property holds for the minimal p- value criterion mPVC provided M n increases at a slower 
rate than log n + a n where a n are weights appearing in the scaling of p- values. Similar result is proved for 
maximal p-value criterion MPVC. Both results apply also to the case when M n is constant provided the 
full model is correctly specified. We also introduce and investigate less computationally demanding 
greedy versions of the discussed methods. 

In the last section we present the results of limited simulation study which shows that the introduced 
methods perform on average better than AIC and BIC criteria. In particular, their performance measured 
by probability of correct subset detection and prediction error is much more stable when the length of list 
of models M n increases i.e. regression model becomes sparse. 

In the paper we focus mainly on explanation i.e. finding the model which adequately describes the data. 
Besides the immediate application of model selection methods to to the second main task of prediction let 



us mention their use in construction of data-adaptive smooth tests (see e.g. Ledwina (1994)). 

Problem of linear model selection when the number of possible predictors increases with the sample size 



has been studied from different angle by Shao ( 1997 ) who defined the optimal submodel to be submodel 
minimizing the averaged squared prediction error and investigated conditions under which the selected 



model converges in probability to this model. Moreno et al. (20101 considered Bayesian approach to this 
problem and proposed using Bayes factors for intrinsic priors as selection criteria. 

The main contribution of the present paper is establishing consistency of the criteria based on p-values 
when the linear model dimension grows. The result is proved for the random design as well as for the fixed 
design scenario, the former being treated in detail. Intrumental in the proofs are Lemmas |3j |4j [5] which 
can be also useful for different purposes. 

The paper is organized as follows. In Section 2 we introduce considered selection criteria. In Section 
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3 we discuss the imposed assumptions and consistency results for the family of models consisting of all 
subsets of predictors as well as hierarchic family. We also introduce greedy modifications of the considered 
criteria. Section 4 contains proofs of the main results and Section 5 discussion of the results of numerical 
experiments. Proofs of some auxiliary lemmas are relegated to the Appendix. 



2 Model Selection criteria for linear regression models based on 
p-values 

We start by explicitly stating the basic assumption we impose on random-design regression model. Assume 
that the rows x^,...,x^ of a matrix X(n x M n ) are iid, x/ = x ; = , . . . )', I = 1, ...,n. 
Throughout we consider the situation that the minimal true model is fixed i.e. it does not change with n. 
Vectors {x^"' , . . . , x^™' } constitute rows in an array of iid sequences of M n -dimensional random variables. 
We impose the condition that M n is nondecreasing and that the law of the first M n coordinates of xj" 
coincides with that of xj ' i.e. the distribution of attributes considered for a certain sample size remains 
the same for larger sample sizes. We also assume throughout that the second moments of coordinates of 
x^ are finite for any n. As any submodel of ( 1 ) containing pj variables , ■ ■ • , )' can be described 
by set of indexes j = {ji, . . . ,j Pj } in order to make notation simpler it will be referred to as model j. The 
minimal true model will be denoted by t and p t will be the number of nonzero coefficients in equation 
0. The empty model Y = e will be denoted briefly by and the full model ([lj by / = {1, . . . , M n }. 
Note that M n — Pf- Let 0j — ($ji> • ■ ■ > $j p .)' be a maximum likelihood (ML) estimator of f3 calculated 
for the considered model j. We denote 0f, ML estimator in the full model, briefly by (3. Let M. be a 
certain family of subsets of a set / and x; t = , • ■ ■ , )' be a vector of variables which pertain to 
the minimal true model t. Througout this paper with exception of Section 3.2 we will impose the following 
assumption: 

(AO) E(xitx' lt ) is positive definite matrix. 

The main objective of model selection is to identify the minimal true model t using data (X, Y). Let 
Lj (T 2(Y|X) be the conditional density of Y given X. Consider two models j and k where the first model is 
nested within the second model. Denote by D" k likelihood ratio test (LRT) statistic, based on conditional 
densities given X, for testing Hq : model j is adequate against hypothesis Hi : model k is adequate whereas 
j is not, equal to 

f /W y i x ) 

f^j(Y|X)' 



g? fe = 21o g < f^ v|vV (2) 



where aj = RSS(j)/n and RSS(j) is a sum of squared residuals from the ML fit of the model j. We 
recall that ML estimator (3 k coincides with Least Squares estimator of (3. When j and k are linear models 
it turns out that LRT statistic is given explicitly by 



D% = -n\og 



RSS(k) 
RSS(j) 



= -nlog(l - iZ? fc ), 
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where 

RSS(j) ~ RSS(k) 
RSS(j) 



is coefficient of partial determination of variables belonging to k\j given that variables in set j are included 
in the model. Under the null hypothesis Ho it follows from Cochran's theorem (cf. e.g. Section 5.5 in 

& ~ Beta( p -^ 



Rencher and Schaalje (|2008b) that given X RSS(j) ~ a 2 X ■ and i?" - Beta{^^, provided X is 



of full column rank. 

Let F and G be univariate cumulative distribution functions and T be a test statistic which has dis- 
tribution function G not necessarily equal to F. Let p{t\F) = 1 — F(t). By p-value of a test statistic 
T given distribution F (null distribution) we will mean p(T\F). We will consider p- values of statistic 
Rj k given Beta distribution with shape parameters Pk 2 Pl and n ~ 2 Pk . In order to make notation simpler 
p(RJ k \Beta( Pk 2 P3 , n ~ 2 pk )) will be denoted as p(R™ k \pk,Pj). We define the following model selection crite- 
ria based on p-values of statistic i?™ fe when one of the indices is held fixed and the other ranges over all 
potential models. 

Minimal p-value Criterion (mPVC) 

MZ = argmin^e^-p^l^O), 

where p(Rq O \0, 0) = e a "/y/n and (a n ) is a sequence of nonnegative numbers. When a minimizer is not 
unique, the set with the smallest number of elements is chosen. In the case of ties, arbitrary minimizer 
is selected. Observe that when a n = then from among the pairs {(Ho, Hj)} we choose a pair for 
which we are most inclined to reject Hq and we select the model corresponding to the most convincing 
alternative hypothesis. For positive a n the scaling factor e Pjan is interpreted as additional penalization 
for the complexity of a model. 
Moreover, Maximal p-value Criterion is defined as 
Maximal p- value Criterion (MPVC) 

Mm = argmax je ^e-^ a " p(R] f \M n , Pj ), 

where p(R r j^\M n ,M n ) = 1 and a n — > oo. Thus from among the pairs {(Hj, Hi)} we choose a pair for 
which we are most reluctant to reject Ho in favour of the full model hypothesis. We stress that the 
additional assumption a n — > oo needed for consistency of MPVC is not required to prove consistency 
of mPVC. This point is discussed further in Section 3. Note that in the definition of both criteria the 
existence of encompassing model, either from below or from above, is vital for the construction. The idea 



of encompassing has been used in Bayesian model selection (see e.g. Casella et al. (2009)). 
Observe that for a fixed number of variables pj p- value p(Rgj\pj, 0) is a strictly decreasing function of 
R-oj. Thus the set is actually chosen from among subsets for which Rftj is maximal for the stratum 
Pj = 1, . . . ,M n . The same observation also holds for MPVC as well as for BIC and AIC. Observe also 
that if these criteria choose subsets of the same cardinality, these subsets necessarily coincide. 
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3 Results 



3.1 Random-design regression 

The main result of this section is consistency of the introduced selectors. Depending on the context we 
will use some of the following additional conditions on the horizons M n , norming constants a n and matrix 
X. 

(Al.r) M n /(a n + log(ra)) -> as n -> oo. 
(ALT') M n /a„ -> as n -> oo. 

(A1.2) lim n _ i . 0o M„ > max iet i =: z max . 

(A1.3) The minimal eigenvalue n n of E 

[ Xl («) Xl (»)'] i s 

bounded away from zero, i.e. n n > K > for some 

K > and n G N. 
(Al.4) For some 77 > 0, n _1 M^ + ' J -» and 

sup sup E|d'z (,l) | 4r2/ ^ < 00, (4) 

n ||d||=l 

where = Efx^x^™' ] _1 / 2 x^"' ) is the standardised vector x^"' i.e. E(z^z^') = I and [2/77] is 
the smallest integer greater than or equal to 2/rj. 

(Al.5) a n /n — > as n — > 00. 

Assumptions (Al.l') and (Al.l") are two variants of the condition on a rate of divergence of M n . As M n 
is nondecreasing, the limit in (A1.2) exists and is either finite or equal to infinity. Condition (A1.2) is a 
natural condition stating that ultimately the list will contain the true model. The assumptions (Al.3) and 
the second part of (Al.4), used in Zheng and Loh (1997), imply in particular that with probability tending 



to one (X'X) -1 exists and therefore (3 is unique. Similar conditions are used by Mammen (1993) to study 
the asymptotic behaviour of bootstrap estimators of contrasts in linear models of increasing dimension. 
We will consider in detail the case when and are optimised over all subsets of / i.e. A4 = V and 
comment on the situation when the nested list of models is considered: Mnested = {{1, 2, . . . , i}}i=i,...,M„- 
The first result concerns consistency of the minimal p-value criterion. 

Theorem 1 Let M = 2*. Then under conditions (AO), (Al.l'), (Al.2), (Al.3), (Al.4), (Al.5) 
P(M^ =t)-t 1, as n -> 00. 

As it follows from the proof an Lemma [4] condition (Al.l') may be weakened in Theorem 1 to (a n + logn — 
M n )/-\/M n — > 00. We state now analogous result for MPVC criterion. 

Theorem 2 Let M. = 2* . Then under conditions of Theorem^ with (Al.l') replaced by (Al.l") 
P{MIj = t) -> 1, osn^oo. 
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In order to compare assumptions of the above results note that when M n grows more slowly than log(n) 
we can take a n = in the case of criterion M" t . However, in the case of the assumption (Al.l") is 
obviously not satisfied for a n = 0. 

It follows from the proof that the condition (Al.l" ) may be weakened in Theorem[l]to (a n — M n )/ \JM n — > 
oo. 

Proofs of Theorems [T] and [2] are given in Section 4. 

Consider now the case when the criteria are optimised over nested list of models Mnested = {{1, 2, . . . , i}}i=i,....M, 
and define z max = max^gt i as the largest index of nonzero coefficient in the true model. In this case our 
goal is not to identify consistently the minimal true model t but rather i max , which is equivalent to consis- 
tent selection of a set t max — {1, . . . , i m ax}- It turns out that this property holds under weaker conditions 
than in Theorem [I] and [2] Namely, the conditions (Al.3) and (A1.4) can be omitted. In this case the 
condition (AO) will be slightly modified. Let x; 4max = (a;{™ , . . . , )' be a vector of variables which 

pertain to the model {1, . . . , i max }. Instead of (AO) we assume (BO): E(x; tmax xJ t ) is positive definite 
matrix. Then under conditions (BO), (Al.l'), (A1.2)and (A1.5) P(M^ l = i max ) — > 1 and analogous result 
holds for provided (Al.l') is replaced by (Al.l"). This is proved along the lines of the proofs of 
Theorems GQ and HJ 



In order to lessen computational burden of all subset search we propose two-step model selection with 
the first step consisting in initial ordering of variables according to p-values of coefficient of partial de- 
termination ([3]). This method is analogous to the procedure proposed in Zheng and Loh (1997) in which 



variables are ordered according to absolute values of t-statistics corresponding to respective attributes. 
Then in the second step an arbitrary criterion Crit is optimised over nested family of models. Specifically, 
the greedy procedure consists of the following steps. Let 

PV i =p(R^ f _ {i})f \M n ,M n -l), i=l,...,M n (5) 

be the p-value of statistic fl^r.jw for testing Hq : model / — {i} against Hi : model /. Then 
(Step 1) Order the p-values in nondecreasing order PV^ < PVi 2 < . . . < PV% M ■ 

(Step 2) Consider the nested family {{ii, 12, ■ ■ ■ , *fe}}fc=i,...,M„ and optimise criterion Crit over this family. 
It can be shown that under (A1.2)-(A1.4) 

lim P (max PVi < min PVi) = 1. 



The proof of the above assertion is a simple consequence of Theorem 2 in Zheng and Loh (1997). This, 
together with Theorems [l] and [l] for the case of the nested list of models, when minimal or maximal p- value 
criterion is considered as Crit, leads to the following corollary. 

Corollary 1 Under conditions of Theorems^ and^respectively the greedy versions of mPVC and MPVC 
procedures are consistent. 
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Observe that since parameters of beta distribution used to calculate p-values in ([5]) do not change with i, 
the ordering in the first step is equivalent to ordering wrt values of R?f_r i y\f, or to the ordering wrt to 
absolute values of i-statistics when the full model is fitted. 

3.2 Deterministic-design regression 

In this section we will briefly discuss the case when the design matrix X is nonrandom. We allow that 
the values of attributes x[™\ . . . , xj 7 ^ of I th observation may depend on n. Recall that x; t = xj^ is a 
vector of variables which pertain to the minimal true model t. In the case of all subset search we replace 
condition (AO) by the following assumption 

(CO) n _1 5Z" =1 x it xj f — > W, as n — > oo, where W is a positive definite matrix. 

In the case of random covariates the above convergence in probability follows from The Law of Large 
Numbers. We also replace conditions (A1.3) and (A1.4) by the following assumption 

(CI) The minimum eigenvalue k n of n X'X is bounded away from zero, i.e. k n > k > for some k > 
and n E N. 

Recall that (3 = (fix, . . . , $M n )' is the least squares estimator based on the full model /. Let Tj = 
o- _1 [(X'X)^] -1 / 2 be the corresponding t-statistic. It can be easily shown that aTj = /^[(X'X)^ 1 ] -1 / 2 + 
op(l), for i g i. Thus by assumption (CI) P(aTj > Cn^ 1 / 2 ) 1 as n -> oo, for some C > 0. This 
implies the conclusion of Lemma [5] in Section 4, namely that for i £ t with probability tending to one 
RSS(f — {i})/RSS(f) is bounded away from 0. As (A1.3) and (A1.4) are used in the random-design case 
only to prove Lemma [5] it follows that the analogous results to Theorem [T] and Theorem [2] hold for the 
deterministic-design case. 

Corollary 2 Under conditions (CO), (ALT), (Al.2), (CI), (Al.5) 
P(M^ =t)-> 1, asn^oo. 

Corollary 3 Under conditions of Corollary^ with (Al.V) replaced by (ALT') 
P{MIj = t) -> 1, as n -> oo. 

Consider the case of nested family search. Recall that xji max is a vector of variables which pertain to the 
model {1, . . . , i max }. If condition (BO) if replaced by the following assumption 

(DO) Yli=i x (tmax x zt ^ W, as n —> oo, where W is a positive definite matrix. 

then results discussed at the end of Section 3.1 hold for deterministic design. 
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4 Proofs 



We first state auxiliary lemmas which will be used in the proof of Theorem [T] The first one proved in 



Pokarowski and Mielniczuk (2010) gives an approximation of tail probability function of beta distribution. 



Let B a h be a random variable having beta distribution with shape parameters a and b and B{x, y) denote 
beta function. Define an auxiliary function 



1 — a + (a + b)x ' 
for a, b, x € R such that x ^ (a — l)/(a + b). 
Lemma 1 Assume x > 2=4 • Then for a > 1 



{l-xfx^ 1 , , (1 - a;) 6 ^" 1 , . . . 

B { lb)b * P ^ > ^ * ( ^) t ^ + < 6 > 

and /or a < 1 

i?M)6 (1 + L(a ' 6 ' ^ * P ^ > * ] * B(a[b)b ■ (?) 
The following Lemma states simple but useful inequalities for gamma function. 

Lemma 2 Let a = p/2 and b = (n — p) /2, /or some N. XTien 

r(6)6 a < T(a + 6) < -?=r(6)(a + b) a . 
v 7r 

The above Lemma implies an inequality for beta function B(a,b) — T(a)T(b)/T(a,b) 

b a ~ l < 1 2 (q + 6)' 

r(o) " 6B(o,6) " v^F &r » ' 

for a = p/2, 6 = (n — p)/2 and p,n£N. 



Remark 1 Lemma^ easily implies inequality T(p/2) < (\p/2~\ — 1)! < p v l 2 for p > 1, which will be 
frequently used throughout. 

The following Lemma states that for a proper submodel of the true model t variance estimator is asymp- 
totically biased, j C k denotes a proper inclusion of j in k. 

Lemma 3 (i) For j D t, j € .M RS ^^ cr 2 as n — > oo. Moreover, for j G t, j E M if (AO) is satisfied 
then RSS (j' > a 2 + A, as n — > oo, where A,- > . 

(nj Let j C tmax.. j G A^„ es t ec i and assume (BO). Then RS ^^ — y a 2 + Xj as n — > oo, where Xj > . 
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Lemma 4 Let R n be a sequence of real numbers such that (R n — M n )/y/AI n — > oo as n oo. Assume 
also that M n /n — > and matrix X'X is invertible with probability tending to 1. Then 



Pin log 



RSS{t) 
RSS(f) 







as n — r oo . 



Remark 2 Observe that as (R n — M n ) / V M n = \/M n (R n /M n — l), the imposed condition on R n is implied 
by R n /M n — > oo. Thus in particular Lemma^ implies that 



RSS(t) 
RSS(f) 



exp 



Rr. 



for any R n such that R n /M n — > oo. Observe moreover that Lemma^ holds true also in the case M n = M 
when the condition on R n reduces to R n — > oo only and thus RSS(t)/ RSS(f) — Op(exp(n^ 1 )). This can 
be seen directly from Lemma^ and the fact that RSS(t) — RSS(f) ~ Xm-jjj as ^ follows from them that 
i?« = Opin- 1 ) and thus n\og(RSS(t) / RSS(f)) = O p {\). 



Lemma 5 Assume conditions (A1.3) and (Al.^). Then there exists a > such that 

- 1 



P < min log 



RSS(f - {i}) 
RSS(f) 



> a 



as n —> oo . 



Thus Lemma [5] implies that with probability tending to 1 RSS(f — {i})/ RSS(f) for i g t is bounded away 
from 0. 



4.1 Proof of Theorem Q] 

We will consider separately two cases: the first when the true model t contains nontrivial regressors (p t > 1) 
and the second, when it equals the null model. 

Case 1 (p t > 1). We will treat the case pt > 2 in detail, the case pt = 1 is similar but simpler and relies 
on instead of ^ to treat p(R% t \p u 0). 

(i) Let j be such that j D t i.e. t is a proper subset of j. We will prove that P[e Pta,n p(RQ t \p t , 0) > 
infj^t e p i an p(RQj\pj, 0)] — ► as n — > 00. Using (fsj) with a = pt/2 and b — (n — p t )/2 we obtain the 
following inequalities for sufficiently large n 



T>(Pt n-pt 

"I 2 ' 2 



)( 



n—pt 
2 . 



< 



2(f)' 



< 



2(§r 



4(f) 



Et-1 



^mr(f)-^(f)r(f) 



'Pt' 

v 2 . 



(9) 
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Moreover for j D t and sufficiently large n 



2 



T3(P± n-Pj 
-"v. 2 ' 2 



Note that 



> 



P J 1 Pt+! i Pt+1 i Pt + 1 i 

f n-M„ \ 2 1 / n-M„ \-S 1 /n\ 2 1 

V 2 / > V 2 / > V 2 / \2J /-ir,\ 

pj_ — p t + l — Pt + i ' \ 1V ) 

M n 2 M n 2 M n 2 



P (inf flfc > sup 1— < P {R n ot > (M n - 2)/n) -+ 1, 
which follows from Lemma [3] and the fact that M n /n — > 0. Thus the assumption of Lemma [l] is satisfied 



for x = R, 



" « — p± h — n ~ p i 

j , U — -j- , U — 2 



and all j D Using we have 



P[e^"p(P£|p t ,0) > inf e^-p^p^O)] < 



P 



P 



> inf 



< 



(1 - P&)^[1 + L (f , ^+Pg t )]e^ > . nf (1 - P^^TO^- 1 ^ 1 ^ 



Taking logarithms and using inequalities (pi), ([To]) we obtain 



(11) 



Ppogp(J^|pt,0) + p t a„ > inf logp(i^|pj,0) + (p* + lR] < P 



where 



W n = On + 5 log Q) -log[l + i 



Pt n-pt 



7?" 
-fin* 



Pt + 1 



1 log 



Pt + 1 



2 ' 2 
log(M„) - log 



n-pt 



M„ 



log 



RSS(t) 
RSS(f) 



1 log(P™ )+ 



logL 



Assumption M n /(a n + log(n)) — > 0, Lemma [3] and the fact that Pn,t — > c 2 > imply that there exists 
a sequence W n of real numbers such that P(W n > W n ) — > 1 and W n /M n — > 00. Now the required 
convergence follows from 



n- p t 



log 



RSS(t) 



RSS(f)_ 

which in its turn is implied by Lemma [4] 

(ii) Consider now the case j ^ t and let i — e N be such that i € t n j c . We will prove that 



P[e Pta "p(R% t \p t ,0) > iafjit e^ a "p(P^|p 3 , 0)] -> as n -> 00. Define M(n,i) = max{P, 



2il/„ 



0(/-{i})' («-M„) 
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for i £ t. Assume first that pj > 2. Using ([6| and ^ we have 



eP^- P (R^\ Pj ,0) > e 2a «p(M(n,i)\ Pj ,0) > 



D 



e za ~l 



( pj_ "-Pj \ ( "-Pj \ 

V 2 ' 2 ) \ p 2 ) 



> 



n—M„ \ T 
2 J 



D / Pj "~Pj 

1 2 ' 2 



> 



e 2a "[l - M(n,i)]5¥„ ! 



e 2a "[l-M(n,i)]tM- 1 . 



(12) 



From ^ and (|9]) 



e*"°»p(P£ t |p t ,0) < 



ePt a n{1 _ R n t (*) S -1 [! + L ^Pi )jR » )] 



(13) 



Using ( 12 1 and ( 13 ) we have for p t > 2 and pj > 2 



P[e P t a„ logp(i? n o) > inf e w Q " logp(^> J , 0)] < P inf - log 



[i€t 2 



(1 - M(n,i))RSS(0) 
RSS(t) 



<S r , 



where 



S„ = a„(p t - 2) + (! - l) log (|) 



log 



' 1 ' If' — 2^ ,R ™ 



io g r 



2 & \RSS(0)J 
-l (Vt- 



log 



+ 



log(M„). 



In view of definition of M(n, i) the last probability can be bounded from above by 

"(1- 5^)^(0) 



P < inf - log 

iet 2 



RSS(t) 



<S n }+Pt -log 



RSS(t) 



<s n y 



The second probability above converges to zero in view of Lemma [3] Consider the first probability. Since 



the number of elements of t is finite it suffices show that P 1 § log 
Namely, it is bounded from above by 



RSS(f-{i}) 
RSS(t) 



< S n > — > for any i E t. 

















> 


P \ 


n log 



■ RSS(f~{i}) 
RSS{f) 

: RSS(f-{i}) 
RSS(f) 

RSS(f~{i}Y 
RSS(f) 



log 



RSS(f) 
RSS{t) 

■■■ ->>'„ } + Pi |lo; 



<5„| < 

y \ RSS(J) 
1 [ RSS(t) 
RSS{t) 



RSS(f) 



>S n ). 



(14) 
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~ p p 

From assumptions (A1.5) and (Al.l') S n /n — > and S n /M n — > oo, respectively. Thus the convergence 
to zero of the above two probabilities in ( 14 ) follows from Lemma [5] and |4j respectively. The case pj = 1 
is treated analogously. 

Consider now the case pj — 0. From ( 13 1 we have 



P[logp(P£ t |p t ,0) + Pt a n > logp(P£ |0,0)] = P[logp(P£ t |p t ,0) > a n - - log(n) - p t a n ] < 



P 



n-Pt 
2 



log 



RSS(0) 
RSS(t) 



<G r 



(15) 



where 



G n = ( Pt - IK + \ log(n) + (f - l) log ( 
log 



1 + Jfc 



log 



+ logF 



"(I) 



The convergence to zero of the probability in (151 follows from Lemma[3]amd assumption (A1.5). 
Case 2 (p t — 0) i.e. the true model is null model. We treat in detail the case pj > 2. Define M(n) 



max{i?Q^, ^zm~}- Note that the assumption of Lemma jlj is satisfied for x = M(n), a = ^f, and b = ^ 
Using ^ and ^ we have 



e p ^ P (RZ\ Pj ,Q) > e 2a »p(M(n)\ Pj ,0) > 



B 



( P± n-pj \ ( n-p-j \ 
V 2 ' 2 / V 2 / 



> 



B 



( pj_ n ~Pj \ ( "-Pj ^ 

v 2 ' 2 A 2 y 



> 



e 2a rl [ 1 _ M( - n )]^ Mn 2 

EI 
M„ 2 



= e 2an [l-M{n)f^M- 1 . 



(16) 



Using ( 16 1 we obtain the following inequality 



P[logp(P£ |0,0) > inf logp(J^ J |p J ,0)+2a n ]<P[o n -ilog(n)> inf logp(i^ | Pj ■, 0) + 2a„] < 
" ~ Pl ^ log[l - M(n)] > a„ + - log(n) - log(M n )| < 



2 

n-pt 
2 

n-p t 



log 



log 1 



PS^(0^ 



> a„ + i log(n) - log(M n ) |> 



2M n 
n — M,, 



> a n + - log(n) - log(M„) 



(17) 



From Lemma |4] and the assumption M n /(a n +log(n)) — > the first probability in (17) converges to zero. 
The same assumption implies that the second term is ultimately 0. This completes the proof. 
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4.2 Proof of Theorem! 



The proof is similar to that of Theorem 1 and splits into two cases: M n — p t > 1 (corresponding to the 
case pt > 1 in the previous proof) and M n — p t (corresponding to the former case pt — 0). We give the 
sketch of the proof only. 

Case 1 (M n — p t > 1). We discuss the situation when M n — p t > 2, the remaining case relies on ([7| 
instead of (jHJ). Define M(n,t) = max{i?™y, ^j-}. Note that the assumption of Lemma [l] is satisfied for 
x = M(n,t), a = - n ^ Pt i an( i b = . In this case condition a > 1 is also satisfied. Analogously to the 



proof of ( 16 1 we obtain 



p(R? f \M n ,p t ) >p(M(n,t)\M n ,p t ) > [l-M{n,t)f=^M n 



(18) 



(i) Let j be such that j D t i.e. t is a proper subset of j. We will prove that 

P[e-P ta "p(R? f \M n ,p t ) < sup jDt e-Pi a "p{Fq f \M n , Pj )} -> as n -> oo. For j Z) twe have e~ Pj a ™p(R™j \ M n , pj ) < 



exp[— (pt + l)a„] . This inequality also applies to j = f. Thus using ( 18 ) we obtain the following inequalities 



P[e-^p(R^ f \M n ,p t ) < supe-^p(R] f \M n , Pj )} < 
/' ( I U ~ — I log[l - M(n, t)] ~ log(M„) - Pt a n < -{p t + l)a n } < 



P 



2 

n - M, 
2 

n - M, 
2 



log 
log 



RSS(t) 

RSS(f) 

2M n 



1 - 



n - M, 



> a n - log(M„)| + 
> a„ - log(M„; 



The above bound converges to zero in view of the assumption M n /a n —> and Lemma |4j 
(ii) Consider now the case j ~£ t and assume that pj < M n — 2 (this corresponds to pj > 2 in the previous 
proof). Let index i — be such that i <G tC\j c . It follows from Lemma[5]that the assumption of Lemma 
[ljis satisfied for x = R^f_^y)f, a = M "~ Pj , and b = "~*' /ti . Moreover the same reasoning yields for all 



3 t * L V i 
inequalities 



Mn-Pj n-M n 



,Rf_^ijf) < M„ wih probability tending to 1. Using (6 1 we have the following 



e -P^p(R^\M n ,p J )<p(R u _ {l})f \M n ,p J ) < 

n — Al — pj 

I 1 - %-{*})/] ~ s ~ [%-{«})/] ~~ 5 1 " 



fl(^,^) (^) 
„ 2n~ r 



t 1 " 



v 2 , 



1+1 



M„ - n - M„ 



>•%-{<})/ 



< 



(19) 



Thus 



P[e- p * a "p(R? f \M n ,p t ) < snpe- p = a "p(R] f \M n , Pj )} < 
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M(n,tj\ RSS(f-{i}) 
RSW) 



where 



K n = p t a„ + log 



Similarly to the proof of (14) we obtain that the RHS tends to 0. 



The case pj > M n — 2 is simpler and uses ^ instead of ([6| . 

Case 2 (M n = p t ). Thus e- pta ^p{R^ } \M n ,p t ) = e - M ™ a ». Assume p 3 ^M n - 2 and let i = be such 

2 ' 2 



that i <E j c (~) t. Then using L ( 



M " Pl *=^*-,R jf ) < M n and M (cf dlQb it is easy to establish that 



e-^ a "p(R^ f \M n , Pj ) < p(R (f _ {i})f \M n , Pj ) < [1 - %_«)/] 
Then it follows that 



2n 2 



^r(^) 



-[1 + M n ] 



where 



P[e-^ a »p( J R^|M n ,p t ) < suve-^p(R] f \M n , Pj )} < 



r> { sup ^ 2 Mrt ] log 



ii55(t - {»}) 



Jf„ = M„a„ + log(2/V5r) + ~ bg(n) - lo g r(M„/2) + log(l + M n ) + log(M„). 

The convergence to zero of the above probability follows from Lemma [3] and the assumption a n /n — > 0. 
The case pj > M n — 2 is analogous. 



5 Numerical experiments 

In this section we study the finite-sample performance of the model selection procedures. We consider 
criteria defined in Section 2: minimal p- value criterion M™ with a n = which will be called simply in this 
section mPVC and two scaled p- value criteria with scalings which were empirically chosen, namely minimal 
p- value criterion with a n = log(n)/2 and maximal p- value criterion with the same a n called mPVCcal and 
MPVCcal, respectively. As benchmarks we considered performance of classical criteria based on penalized 
log-likelihood which have the form 

argmax je _ M {21ogf g ^ -|(Y|X) - PjC n } = &igmaXj eM {-nlog[RSS(j)/n] - p 3 C n } 
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with penalties: C n = 2 and C n = log(n) which correspond to Akaike (AIC) and Bayesian (BIC) information 
criteria, respectively. 

5.1 Simulation experiments 

The simulation experiments were carried out with sample sizes n — 75, 100, 200, 300, 500, 1000 repeated 
N = 500 times. We consider the following lists of models 

(Ml) t = {10}, = 0.2, M n = 30, 

(M2) t = {1, 2, 5, 6}, (3 = (0.9, -0.8, -0.4, 0.2)', M n = 6, 

(M3) t = {2, 4, 5}, /3 - (1, 1, 1)', M n = 5, 

(M4) t = {2k + 7 : k = 3, . . . , 12}, (3 = (1, . . . , 1)', M n = 60. 



In all cases M = 2{ 1 -- M ">. Models Ml, M3 and M4 were also considered in |Zheng and Loh| ( |1997 [) . 
Regressors x™ were generated from M n -variate zero mean normal distribution with (i, j)th entry of the 
covariance matrix Ex = equal = 0.5' l ~ J '. The distribution of (ei,...,e n ) was multivariate 

standard normal. We considered greedy variants of the selection methods, described in Section 3. Table 
[T] presents estimated probabilities of correct ordering, e.g. the probabilities that the coordinates corre- 
sponding to nonzero coefficients are placed ahead the spurious ones. It is seen that for n > 500 for the 
models considered a correct ordering is recovered practically always. We assess the effectiveness of the 
selection rule in terms of the probability of true model selection P(t — t), where t is a model selected by 
the considered rule and mean squared error E(||X/3 - X/3(i)|| 2 ), where /3(t) is the post-model selection 
estimator of (3 i.e. ML estimator in the chosen model. In the experiments estimates of these measures 
calculated as the empirical means of respective quantities were considered. The influence of the sample size 
on the effectiveness of selected rules has been investigated. For models Ml, M3 and M4 criterion MPVCcal 
and mPVCcal perform considerably better for all sample sizes considered than mPVC and commonly used 
BIC and AIC (see Figure 1 and 2). In contrast, in the case of model M2 criterion mPVC works better 
than others. In general, performance of mPVCcal is similar to that of MPVCcal. The results also indicate 
that model Ml with the only one significant variable placed at position 10 is the most difficult for selection 
among the models considered. This is due to the fact that in this case it is difficult to recover the correct 
ordering (see Table[T]), especially for small sample sizes. Secondly the selection criteria seem to work worse 
when the number of nuisance covariates is large. For model Ml we also studied the influence of the value 
of the true parameter fi\ . Figure 3 indicates that performance of both measures is much worse for small 
values of the parameter. The influence of the size of the list M n on the effectiveness of selection rules 
has been also investigated. Figure 4 shows that for model Ml performance of the AIC, BIC and mPVC 
is influenced by the choice of the horizon M n , however, the selection rules MPVCcal and mPVCcal are 
the least affected. We also investigated the influence of the strength of dependence structure of design 
matrix X on the behaviour of selection rules. We studied the cases when the dependence between the 
covariates is respectively stronger and weaker than in the case described above. Namely the covarianccs 
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Y, x (iJ) = 0.8 |i_il and ^x(i,j) = I{i = j} were considered. For the above cases we took also different 
marginal variances of regressors equal to 0.5 and 2. The error variance a 2 was always set to one. The 
experiments show that the probability of true model selection is smaller (and respective prediction error 
larger) than for initial scenario when the dependence is stronger or the variance of covariates larger. How- 
ever, it turns out that the ranking of methods with respect to both considered measures remains the same 
in all above cases. Experiments indicate also that for the considered selection criteria mean prediction 
error behaves approximately as a constant minus probability of a correct selection. 

We also investigated the case of covariates x™ having different distributions. Namely, we considered the 
following regression scenario 

Y = /3'L(U)+e, 

where L(-) = (Li(-), . . . ,Lm„(-))' 1S a vector consisting of the consecutive orthonormal Legendre polyno- 
mials on [—1, 1] and U is random vector with continuous uniform distribution on [—1, 1]. We considered 
the following list of models 

(LI) t={l,2,4},/3=(l,l,l)' 

with horizons M n — 5, 10, ... , 25. The influence of the size of the list M n has been investigated. The 
sample size was set to n = 300. Figure 5 presents the results which are similar to that of the previous 
experiments indicating that mPVCcal and MPVCcal perform the best in this case, and the second best is 
BIC. 



5.2 Real data example 

We consider bodyf at data set ( |Johnsoii (1996)) consisting of records of the percentage of fat in the body 
(dependent variable) together with 13 independent variables for n = 252 individuals. Two independent 
variables were selected having the smallest p-values when the full linear model was fitted. They were 
abdomen and wrist circumference and when used as predictors resulted in the fitted model with a vector 
of estimated coefficients = (0.7661, —2.8379)' and a variance of residuals a 2 — 4.45. A parametric 
bootstrap (see e.g. Davison and Hinkley (1997)) was employed to check how the considered selection 
criteria perform for this data set. Namely, the true model was the fitted linear model with the original 
two regressors, = and the normal errors with the variance equal to a 2 . Additional superfluous 
explanatory variables were created in pairs by drawing from the two-dimensional normal distribution with 
independent components, which mean and variance vector matched that of the original predictors. We 
considered k = 8, 18, . . . , 58 additional variables what amounted to horizons M n = 10, 20, . . . , 60 when 
the true variables were accounted for. Thus M n /n ranged from 0.03 to 0.23. 500 parametric bootstrap 
samples consisting of 252 observations each were created to mimic the original sample and the considered 
selection criteria were employed to choose subset of potential M n variables. Figure 6 presents the results. 
The results are similar to that of simulation experiments indicating that mPVCcal and MPVCcal perform 
the best in this case, and the second best is BIC. 
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Table 1: Estimated probability of correct ordering based on N = 500 trials. 
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Figure 1: Estimated probabilities of correct model selection for models Ml (a), M2 (b), M3 (c) and M4 
(d) with respect to n (on a logarithmic scale) based on N — 500 trials. 
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(c) (d) 

Figure 2: Means od prediction error for models Ml (a), M2 (b), M3 (c) and M4 (d) with respect to n (on 
a logarithmic scale) based on N = 500 trials. 




(a) (b) 

Figure 3: Estimated probabilities of correct model selection (a) and means of prediction error (b) with 
respect to value of parameter /3 for model Ml for sample size n = 300 based on N = 500 trials. 
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Figure 4: Estimated probabilities of correct model selection (a) and means of prediction error (b) with 
respect to M n for model Ml for sample size n = 1000 based on N = 500 trials. 
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Figure 5: Estimated probabilities of correct model selection (a) and means of prediction error (b) with 
respect to M n for model (LI) based on N = 500 trials. 



20 



(a) 



(b) 



Figure 6: Estimated probabilities of correct model selection (a) and means of prediction error (b) with 
respect to M n for bodyf at data set. 

6 Appendix 

Proof of Lemma Q] 

The lemma is proved in Pokarowski and Mielniczuk (20101. For completeness we give an outline of 



proof here. Recall that B a \, and B(x, y) denote a random variable having beta distribution with shape 
parameters a and b and beta function, respectively. Let B x (a, b) = J* i a_1 (l — t) b ~ 1 dt be the incomplete 
beta function. It can be easily proved that 



aB x (a, b) = x a (l - x) b + (a + b)B x (a + 1, b), 



(20) 



and 

B 1 _ x {b,a) = B{a,b) - B x {a,b). 
Consider the case a > 1. Using (20), (21) and assumption x > we obtain the upper bound in (|6 



(21) 



P[B a . b >x] = l 
1 



B(a,b)b 
1 



gxM) = Bi-Jp, a) = 
B(a,b) ~ B(a,b) 

a + b.^ , (a + b)(a + b+l) 

(I — Tl 4- 

6+1 



(1 - x)»x a [l + ^(l-x)+ ' Z Z \ i 1 ~xf + ...]< 



a + b 



B 



(^b-^-^ + bT-^-^^bTi) W + 



(b + l)(b + 2) 
a + b 



{l-x) b x 
B(a,b)b 



-(1 + L(a,b,x)). 
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In order to obtain the lower bound in ^ note that for a > 1 

1 , ,i „. a + b, „ (a + b)(a + b + 1) . n9 . 1 . >t „ , 

BP'' 1 " *> V[1 + 6+1 (1 - ^ + (fc+V + 2) (1 a) V " 

The case a < 1 can be treated analogously. 

For ease of notation we assume in the following proofs that a 2 = 1. Let Q(j) denote projection on the 
column space spanned by the regressors corresponding to coefficients in a given model j. 
Proof of Lemma [3] 

Consider first the case j C t. Denote W = E(x; t xJ t ), which in view of assumption (AO) is positive definite. 
Define A n j = 7i _1 (X/3)'[I — Q(j)](X/3) > 0. Let Dj be a M n x j matrix of zeros and ones such that 
XDj consists of only these j columns of X which correspond to model j. By assumption (AO) and using 
the fact that X/3 = (XD t )/3 where (3 = [/3 tl , . . . , tp ) we have A n j — > A > as n — > oo. The assertion 
follows from the fact that for j Ct 



n-^XOT - Q(j)](X/3) = n-^'A^, (22) 

where 

A = [(XD t )'(XD t )] - [(XD t )'(XD t )]D i [D;.(XD t )'(XD t )D J -]- 1 D;.[(XD t ) / (XD t )] 

and Dj is a p t x pj matrix such that XD^ = (XD t )D,-. Matrix W as a positive definite matrix can be 
decomposed as W = W 1 / 2 W 1 / 2 where W 1 / 2 = US 1 / 2 U', U is an orthogonal matrix and S is a diagonal 



matrix with positive diagonal. The right hand side of ( 22 ) converges in probability to 



(W^ni - W 1 / 2 D J (D^.WD J )- 1 D^.(W 1 ' /2 )']W 1 / 2 ( a > 
since the columns of W 1 / 2 are linearly independent. We have the following decomposition for jci 

n^RSSU) = n-V(I - Q(j))e + rr^XjS/p - Q(j))e + A nJ . (23) 



p 



The first summand converges in probability to a . The last summand A n j — s- A > 0, as has been already 
shown. Provided that X'X is invertible, n _1 2(X/3)'(I - Q(j))e given X has N(0,v n ) distribution, where 
v n = n -1 A n| j 0. Thus 7i _1 2(X/3)'(I — Q(j))e -A- 0. This completes the first part of the proof. For 



j Dt the second and the third term in (23) are equal to zero. This yields the second part of the assertion. 
Proof of Lemma [4] 

Define b n = n(exp(R n /n) — 1). It is easily seen that b n > R n thus b n satisfies the condition imposed on 
R n . For M n = p t the assertion is obvious, thus we assume that M n > p t 
We have the following inequality 



Pin log 



RSS{t) 
RSS(f) 



p f RSS(t) (Rn 

>R ^ =P \RSSU) >e ^ 



ii 
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P{e'Mf) ~ Q(t)]e > b n n- l e'[I - Q(/)]e} < 
P{e'[Q(f) - Q(f)]e > 6„n- x (n - M„ - d n )}+ 
P{e'[l-Q{f)]e<n-M n -d n }, 

where d„ = (n - A/„)( 1+<5 )/ 2 , for some <5 € (0, 1). Matrix X'X has rank M„ and it follows that e'[Q(/) - 
Q(t)]e ~ XM n - Pt an d £ '[I — Q(/)] £ ~ X 2 «-m„ (since ct 2 = 1). By an inequality for cumulative distribution 
function of a chi-square distribution, 



for 6 > (see |Shibata| ( |1981[ )). Thus we have 

P{e'[I - Q(/)]e < n - M„ - d„} < exp 



4(n - M n ) 



->0, 



as n — > oo, since M n /n —> 0. Let 7„ = b n (l—M n /n—d n /n). As e'[Q(/) — Q(t)]e ~ XM„- Pt by Chebyschev 
inequality we have 



P{e'[Q(/) - Q(t)]e - (M„ - ft ) > 7 „ - (M„ - Pt )} < 



2(M n -pt) 
[ ln -(M n - Pt W 



0, 



where the last convergence follows from (pf n — M n ) / \/ M n — > oo. This completes the proof. 
Proof of Lemma [U 

In view of conditions (A1.3) and (Al.4) matrix (X'X) -1 exists with probability tending to one (see the 



proof of Theorem 2 in Zheng and Loh (1997)). Recall that Tk is a t-statistic corresponding to the fcth 
variable. It suffices to prove that for any c„ —> P[min iei log(RSS(f — {i})/RSS(f)) < c n ] — !• 0. Noting 
that 



RSS(f-{i}) _ Tf 



RSS(f) 



n - AL 



+ 1, 



we obtain that 



P[minlog RS ^ f q(f \ l}) < c n ] < P[mmT? < („ - M n )(exp(c ri ) - 1)] 
< P(min7f < (n - M„)(exp(c n ) - 1)). 



Since exp(c n ) — 1 



o(c n ) it suffices to show that P[min iet Tf < Cnc n ] —> 0, for some C > 0. This 
n 2 in Zheng and Loh (19971 who proved tl 

0, for any c n such that c n — > 0. Now the required convergence follows 



follows from the proof of Theorem 2 in Zheng and Loh (19971 who proved that under conditions of this 
Lemma P[minigt a 2 Tf < nc n 
from the fact that a 2 a 2 . 
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